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ABSTRACT 


The Defense Language Institute is responsible for training military and government 
service personnel requiring a foreign language skill. Ten Subskill tests have been 
developed to evaluate the graduating students’ language abilities and to determine if they 
have met the sponsor’s Final Learning Objectives. The Subskill tests in some languages 
have been in place long enough that they can now be studied. This thesis examines these 
Sub skill tests for both Russian and Spanish to determine if the tests have been developed 
and implemented in a manner to efiBciently and consistently discriminate between students 
of different abilities. Three different issues are treated. The ANOVA is used identify 
Subskill tests with significant rater effects and the magnitude of those effects when they 
are present. Item Response Theory is used to examine the Subskill tests at the question 
level in order to identify questions that poorly discriminate between students of different 
abilities. In addition, the ability range that students are tested over is examined. Finally, 
methods using principle components and multiple regression are used to determine which 
tests, if any, can be eliminated with an acceptable loss of information about the students. 
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EXECUTIVE SUMMARY 


The Defense Language Institute is responsible for training military and government 
service personnel requiring a foreign language skill. Ten Subskill tests have been 
developed to evaluate the graduating students’ language abilities and to determine if they 
have met the sponsor’s Final Learning Objectives. The Subskill tests in Russian and 
Spanish have been in place long enough that it is now possible to evaluate them and 
determine if any changes should be made to improve their consistency and efficiency. 

Three different issues are treated. The consistency of Subskill test grading between raters, 
whether the tests can distinguish between students of different abilities, and the degree of 
redundancy of the battery of Subskill tests. 

An analysis of variance shows an inconsistency between the raters score 
assignments for the majority of Subskill tests in Russian and Spanish. In particular, the 
Spanish FLO 30 and FLO 90 have the largest magnitude of rater effects. This lack of 
consistency affects the ability of DLI to compare students whose tests were not graded by 
the same rater. 

An efficient test is made up of questions that discriminate between students of 
different ability and are of different difficulty levels. The application of Item Response 
Theory to the Spanish Subskill test shows that Spanish FLO’s 30 and 40 consist of 
questions that discriminate between students of different abilities. In addition, these tests 
test over a wide range of ability. In contrast, the Spanish FLO’s 60, 70, 80, and 90 consist 
of questions that do not discriminate between students of different ability levels, nor do 
they test over a wide range of ability. The questions that do not discriminate create 
unnecessary variance in the data and provide no information about the student’s abilities. 

Through the use of multiple correlation and principle components, it is determined 
that the Spanish FLO 40 and Russian FLO 30 can be removed fi-om the battery of tests 
given to graduating students with less than a 5 percent loss of variance in the data about 
the students. This will allow for savings in the cost of administering the tests, including 
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the cost of grading the tests as well as a reduction in the time required for the students to 
take the tests. 

Each of the above findings should be addressed to ensure that the Subskill tests 
provide consistent, efficient data to be used to compare students and determine if students 
have met the training objectives. A change in any one of the three areas will effect the 
remaining two. It is recommended that the grading inconsistency be corrected first and 
the removal or elimination of any test be the last of the three changes. It will be necessary 
to record the students’ scores on individual questions to make these changes. By 
recording this information DLI will be able to better monitor the Subskill tests in the 
future. 
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I. INTRODUCTION 


A. BACKGROUND 

The Defense Language Institute (DLI) is responsible for training military and 
government service personnel requiring a foreign language skill. The National Security 
Agency (NSA) and the Defense Intelligence Agency (DIA) set the standards for the vast 
majority of students in the Defense Foreign Language Program. In the early 1990’s these 
two communities developed specific training objectives for students entering professional 
fields in intelligence. DLI was able to combine the requirements from both communities 
into a single set of program objectives for all students. These program objectives are 
referred to as Final Learning Objectives or FLO’s. Subskill tests were developed by DLI 
to be used with the Defense Language Proficiency Tests (DLPT’s) to evaluate whether or 
not the graduating students have met these objectives. (DLI, 1995) These Subskill tests 
are referred to as FLO 10, FLO 20,..., FLO 100. The DLPT’s have been used since 1958 
to evaluate military personnel’s language proficiency. Military personnel are given the 
DLPT’s prior to graduation and throughout their careers. The results of the DLPT’s are 
used to award incentive pay for those in billets requiring language skills to ensure that they 
remain proficient. 

Because Subskill tests are much newer than the DLPT’s, they have not yet been 
evaluated. This thesis provides the first such evaluation. It will focus on three issues: the 
consistency of grading of Subskill tests among raters, whether the tests can distinguish 
between students of differing abilities, and the degree of redundancy in the combined FLO 
DLPT battery of tests. 
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B. AREA OF RESEARCH 


1. Grading Consistency 

For a test to be useful as a comparison tool it is necessary for the grading to 
remain consistent without regard to who graded it. In six of the ten Subskill tests, the 
students respond in English. These tests are graded at the Test Management Center by any 
one of three GS-5’s employed as raters. The remaining four tests are graded at the 
specific language school. Raters use an answer key to grade these tests, interpreting the 
correctness of the student’s response. Normally, only one rater grades a test, which can 
lead to different scores depending on which rater graded the test. For this study, to 
determine if there was a rater effect, all three raters independently graded each student’s 
tests. Over a period of a month, each rater was given all the English response tests to 
grade. The raters do not make marks on the actual answer sheet so it was possible to 
ensure that the raters did not know that the study was being conducted. The result of this 
data collection was one computer scoring sheet from each rater for each student’s test. 
Figure 1.1 shows an example from the data of 10 students’ test scores assigned by 
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different raters from one of the Subskill tests. From this figure it is clear that for these 
students and for this Subskill test, rater 2’s scores are consistently higher than the other 
two. In fact, rater 2 scored student number seven 30 points higher than rater 3. 

2. Ability Range Tested 

The purpose of giving the Subskill tests is to determine if the students have met the 
Final Learning Objectives. To do this it is necessary for the tests to differentiate between 
students of different abilities and to test over a wide range of abilities. In classical test 
theory, where composite scores are used, no consideration is given to the diflSculty or 
discriminating power of a test question when grading a test. In Item Response Theory 
(IRT) each test question can be evaluated for its difficulty and discriminating power, 
allowing the test developer to determine how much information is provided by each 
question about the student’s ability. Using this information, a test can be constructed to 
test over a wide range of ability or to ensure that the students meet a certain cut-off ability 
level. It also allows the test developer to eliminate questions that provide redundant 
information or no information at all and reduce the length of the test .(Hambleton et al, 
1991) 

Subskill tests are evaluated using IRT to determine how much information is 
provided about the student’s ability. This evaluation shows that two of the Spanish 
Subskill tests consist of questions that do differentiate between students and test over a 
wide range of ability. However, the remaining 4 Spanish Subskill tests graded at the Test 
Management Center mainly consist of questions that nearly all students get correct or 
questions that have no discrimination. The result is that these tests do not differentiate 
between students nor do they test over a wide range of ability. This evaluation will not 
examine the validity of the questions. 
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3. Redundancy of Tests 


Thirteen tests are currently given to students just prior to graduation. A reduction 
in the number of tests given while maintaining nearly the same amount of information 
provided by all the tests would be beneficial in terms of students’ time and DLFs budget. 
Jolliffe (1972,1973) discusses methods of selecting variables (or tests) to remove from 
data sets while still maintaining nearly the same amount of variance. Two of these 
methods employing principle components and multiple correlation are used to show that 
for both Spanish and Russian, one test can be removed with less than a 5 percent loss of 
total variance in the data set. 

C. OVERVIEW 

A brief discussion of the testing and grading procedures as well as the collection of 
data is given in Chapter II. Chapter III addresses inter-rater reliability. In Chapter IV, 

IRT is used to investigate what range of ability students are tested over on 6 of the 10 
Subskill tests. The results of this chapter will allow DLI to decide if test questions should 
be added or removed to meet their testing objectives. Chapter V examines whether the 
removal or elimination of one or more FLO’s administered just prior to graduation will 
result in a significant loss of information about the student’s ability. Specific 
recommendations are given in the final chapter. 
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11. DATA 


A. TESTS/TESTING 

Just prior to graduation, students are given a battery of 13 tests to detemune their 
language proficiency and whether they have met the Final Learning Objectives (FLO’s). 
The DLPT’s are used to determine language proficiency in listening, speaking and reading. 
Subskill tests are used to measure the student’s ability to perform the FLO’s in the target 
language and are referred to as FLO 10, FLO 20,..., FLO 100. The test questions are 
different for each language. Table 2.1 lists all the tests and gives a brief description of 
each. 


Test 

Description 

DLPT Listening (List) 

Listen to the Target Language 

DLPT Reading (Read) 

Read the Target Language 

DLPT Speaking (Speak) 

Speak the Target Language 

FLO 10 

Elicit Biographical Data (Speaking and Listening) 

FLO 20 

Two-way Interpretation (Speaking and Listening) 

FLO 30 * 

Listening; Summarize 

FLO 40 * 

Listening; Answer Questions 

FLO 50 

Passage Transcription 

FLO 60 * 

Number Transcription 

FLO 70 * 

Reading; Printed Texts 

FLO 80 * 

Reading; Handwritten Texts 

FLO 90 * 

Translation; Target Language into English 

FLO 100 

Translation; English into Target Language 


Table 2.1. List of tests and a brief description. Tests with an * are graded at the Test Management 
Center. 
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B. GRADING 


The tests at DLI are either oral or written response tests. The DLPT speaking, 
FLO 10, and FLO 20 tests are oral response tests administered and graded at the target 
language school by a speaker of the target language. DLPT listening and reading are 
multiple choice tests graded by computer , FLO 50 and FLO 100 are short answer tests 
whose responses are also in the target language and graded at the target language school. 
The remaining tests, FLO’s 30, 40,60, 70, 80 and 90, are short answer tests with 
responses in English and are graded at the Test Management Center independent of the 
target language. At the time of this study there were 3 employees who graded the tests at 
the Test Management Center. Raters use only the answer key or protocol, they do not 
have a copy of the material presented to the students or of the questions. Since the tests 
are short answer, determining if a response is correct is subjective. The grader fills out a 
computer scan sheet, recording a 1 for a correct response and 0 for an incorrect response. 
Each student’s test is graded only once. The computer scan sheet and student’s answer 
sheet are stored for a maximum of 3 months and then destroyed due to storage 
constraints. 

C. DATA COLLECTION 

1. Language and Group Selection 

At the beginning of the study DLI presented a list prioritizing the languages that 
had classes graduating in the period between January 1996 and March 1996. January was 
the earliest that data could be gathered once the study was approved and March was 
picked as the end of data collection to ensure sufficient time to conduct the study. 

Spanish and Russian were among the higher priority languages and were picked to be 
studied since classes in both of these languages were graduating in a 3 month time period. 
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For these two languages, all students who graduated during the time period are included in 
the inter-rater reliability study and the Item Response Theory study. No other students 
can be used for those analyses since there is no record of their individual responses to 
specific questions on each Subskill test. 

2. Grading Consistency 

In order to compare graders’ evaluations of the students’ responses, it is necessary 
to have each rater grade each test. Due to timing and funding limitations, data was 
gathered on Spanish students tested in January and Russian students tested in February 
1996. The Subskill tests that were studied included FLO 30, 40, 60, 70, 80 and 90. The 
answers of each student were graded separately by each of the three raters. The raters 
were not informed of the study until after the data was collected. The results of the data 
collection were three test scores for each of 56 Spanish students and 29 Russian students. 

3. Ability Range Tested 

The ERT study requires the students’ responses to individual questions. Since the 
score sheets are destroyed after 3 months, the data fi-om the grading consistency study is 
used. Also for this study the grades from only one grader are used in order to minimize 
differences in grading criteria. The scores from the most experienced grader are used for 
this portion of the analysis. Again only FLO 30, 40, 60, 70, 80 and 90 were studied. 

4. Redundancy of Tests 

A larger data set is needed to examine the redundancy of DLPT’s and FLO tests. 
Data was extracted from the DLI data base on all students graduating from the Spanish 
and Russian schools October 1994 to March 1996. This data set includes scores on all 
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tests listed in Table 1. Test scores older than October 1994 are from a different version of 
the DLPT than is currently used and therefore are not included in the analysis. The data 
used in this portion of the analysis consists of the 426 Spanish students and 262 Russian 
students. The original data set contained 529 Spanish and 349 Russian students, but due 
to missing tests scores, many students in the data set can not have their scores used for the 
analysis. The large number of missing test scores is attributed to students going to their 
next duty assignment prior to taking all of the tests. In addition, if students miss a Subskill 
test for another reason, there is no strong requirement for them to make it up, and no data 
is available for these students. Since nothing is known about the students who do not 
take the tests, the results of the analysis can only be applied to the students who do take 
the test and not the general population. 
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III. INTER-RATER RELIABILITY 


A. RELIABILITY OF RATERS 

For scores on a test to be a useful comparison tool, it is necessary for grading to 
be consistent, independent of who graded the test. One way that scores from different 
raters may vary is in the rater’s severity. Some raters may tend to give higher scores while 
others may tend to give lower scores (Longford, 1993). Unless the same rater evaluates all 
students, there is a possibility that some of the students will receive positively or 
negatively influenced scores due to the fact that they were graded by a relatively lenient or 
harsh rater (Raymond, 1990). 

Since DLI currently employs 3 raters, any one of which can grade a student’s test, 
it is necessary to ensure that there is no inter-rater reliability problem. In the data gathered 
to study inter-rater reliability, each student’s tests were graded by all three raters. To 
account for the effect of student ability, students are used as a blocking factor in the 
analysis. First a nonparametric ANOVA method, the Friedman test, is applied to 
determine if there is a rater effect for each of the FLO’s. Once it has been determined that 
there is a rater effect, the size of this effect is estimated using Two-way ANOVA. 

B. ANALYSIS 

1. Non-parametric ANOVA 
a. Methodology 

As mentioned in Chapter II, the data for the first part of the analysis is in 
the form of one observation per rater for each student. This format of data is well suited 
to the Friedman’s Test for a randomized block experiment. The hypotheses of interest for 
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this test is that there is no rater effect. The data is put into aJX S matrix with each 
column representing a rater or treatment effect i, i=l,2,3, and the rows represent each 
student The scores are ranked across rows with ties receiving midranks. Let 

R. be the sum of the ranks for column / then the test statistic for the Friedman test is 

K= — - —y/?,-37(7 + 1) (3.1) 

77(7 + 1) tr 


where 1=3. 

Under the null hypothesis that there is no rater effect, the test statistic 7+ 
has approximately a chi-squared distribution with I-l degrees of freedom. If the resulting 
p-value is small enough, the null hypothesis, that there is no difference between raters, can 
be rejected. (DeVore, 1995) 

b. Results for Friedman Test 

The results of the Friedman test are shown in Table 3.1. As can be seen 


Test 

Spanish 

Russian 

FLO 30 

0.0000 

0.0000 

FLO 40 

0.0004 

0.0035 

FLO 60 

0.3620 

0.3385 

FLO 70 

0.0000 

0.0492 

FLO 80 

0.0000 

0.0000 

FLO 90 

0.0000 

0.0073 


Table 3.1. P-values for the Friedman test for a randomized block experiment. P-values less 
than 0.05 indicate a significant rater effect for that test. 
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from the table, FLO 60 in both languages are the only tests that have no significant rater 
effect. The Russian FLO 70 has a higher p-values (0.0492) than the remaining tests, but 
with a significance value of 0.05 the null hypothesis of no rater effect can be rejected. All 
of the remaining tests in both languages have p-values small enough to reject the null 
hypothesis. 

2. Two-Way ANOVA 

a. Methodology 

A two-way additive ANOVA model is used to estimate the magnitude of 
the rater effects. The factors in the model are the rater and the blocking factor for student 
effects. Since there is only one observation per cell, it is necessary to assume that there is 
no interaction between the students and the raters. This presumes, for example, that one 
rater isn’t more lenient with poor students than another. If an interaction term is included 
in the model, the model would be overparameterized. This analysis is still useful since the 
presence of a nonzero interaction only reduces the probability that the test will be 
significant for rater effects (Lindman, R.,1992). 

b. Results of Two-Way ANOVA Model 

The results of the Two-Way ANOVA model are shown in Table 3.2. As 
with the Friedman ANOVA, a p-value less than 0.05 allows for the rejection of the null 
hypothesis that there is no rater effect. These results are consistent with those in the 
previous section and show that there is a significant rater effect for all tests except the 
FLO 60’s. Examination of the residuals by rater and as a function of the fitted scores 
support the usual ANOVA assumptions of Normality and equal variance. Further, 


11 


i 




Test 

Spanish 

Russian 

FLO 30 

0.0000 

0.0000 

FLO 40 

0.0001 

0.0019 

FLO 60 

0.4318 

0.2697 

FLO 70 

0.0000 

0.03132 

FLO 80 

0.0000 

0.0000 

FLO 90 

0.0000 

0.0010 


Table 3.2. P-values for two-way ANOVA. Values less than 0.05 indicate significant 
rater effect. 


plots of residuals versus fitted values for each rater do not indicate that there is 
interaction between students and rater. 

Now that it has been determined that the majority of the tests have 
significant rater effects, the size of these effects can be estimated. In this model, the 
effects are parameterized so that 

a,=E[X,}-E[X} f=l,2,3 (3.2) 

where X^ is the average score for the f rater and X is the grand mean. For example, a 
+10 effect indicates that the rater’s expected grades are 10 points more lenient compared 
to the expected grade averaged over all three raters. The results of the Two-Way 
ANOVA Model are shown in Figures 3.1 and 3.2 (Appendix A contains the actual values 
for the effects and standard deviation of each rater on each FLO). From these two figures 
it is obvious that the Spanish FLO’s 30 and 90 have the largest rater effect and should be 
investigated first. The magnitude of the rater effects on the Spanish FLO 80 and Russian 
FLO 70 are small and may even be acceptable to the Test Management Center. 
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points more leniently on a test compared to a rater with no effect. 



points more leniently on a test compared to a rater with no effect. 


C. DISCUSSION OF RESULTS 

From both the Friedman and Two-Way ANOVA tests, the FLO 60’s are the only 
tests not having a significant rater effect. FLO 60’s are the only tests with numbers for 
answers, which requires little if no subjective interpretation on the part of the raters. 

Recall that the raters only have the answer key or “protocol” to determine if an answer is 
correct or incorrect. Unlike FLO 60, the remaining Subskill tests have answers consisting 
of words and sentences. The two tests with the largest rater effects are Spanish FLO’s 30 
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and 90. The FLO 30 requires the student to summarize what they have heard and the 
FLO 90 requires the student to translate from the target language into English. Both of 
these tests seem to require the student to make some decision as to what is important 
which may be different from the protocol. The remaining tests ask questions or require 
translation from written texts. These tests may give the student more direction toward a 
correct answer or what the protocol is looking for. The tests are different for each 
language which explains why the Russian and Spanish effects are different. 

For the tests with effects that DLI finds the Test Management Center can either 
attempt to compensate for the effects or eliminate them. To compensate for the effects, it 
would be necessary to record which rater graded each student’s test and add or subtract 
the rater’s effect from the student’s score. This is possible since there is a block on the 
computer scoring sheet to indicate which rater graded the test. The method would be 
effective as long as there is no change in the raters or a change in a rater’s effect. Since 
this is unlikely, it is more useful and effective to reduce or get rid of the rater effects. 

One step in reducing the rater effect is to identify questions whose answers are not 
well defined and require too much subjective evaluation. Once these questions are 
identified, the answer key could be rewritten with more specific guidelines to ensure 
consistent evaluation of the correctness of the answer. Another way to reduce the rater 
effect may be to include the translation of what the student is presented with as well as the 
question itself By allowing the rater to have this information, he should be able to better 
decide if the student understood the material. In addition, each question could be graded 
on a scale instead of a 0 or a 1. This would allow the rater to give partial credit to a 
student who understood the main idea but could not answer the full question. 

Finally, the raters should undergo periodic training to reinforce proper grading 
criteria. The implementation of the suggestions will increase the time required to grade 
the tests but will create a more stable data base which can be used for future analysis. 
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IV. SUBSKILL TEST RESULTS ANALYSIS 


A. COMPOSITE TEST SCORE ANALYSIS 

In general, the only data that is recorded at DLI from each of the Subskill tests is 
the composite score. Normally no record is kept in the data base of the individual 
question scores. Since each question recieves the same number of points for a correct 
response, there is no difference in points awarded for difficult questions compared to easy 
questions. A student who answers a difficult question correctly but misses an easy 
question recieves the same score as a student of lower ability who answers the difficult 
question incorrectly but gets the easy question correct. By keeping only composite 
scores, DLI has no way of knowing if the Subskill tests are differentiating between 
students of different abilities. Figure 4.1 shows the histogram of the grades on the Spanish 


Histogram of Grade Distribution for Spanish FLO 30 



0 20 40 60 80 100 

Composite Scores 


Figure 4.1. Histogram of Composite Scores for Spanish FLO 30. Score is in percent. 

FLO 30 (To reduce rater effect, all scores used in this chapter are from rater 1). By 
examining this figure it is impossible to determine whether the test measures over a wide 
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range of ability and is correctly discriminating between students of different ability or if the 
test is poorly worded and the students all have the same ability. It is therefore necessary 
to examine the questions that make up each test for their individual difficulty and how well 
they discriminate between students of different abilities. A question discriminates well if it 
is useful for separating students into different ability levels. In Figures 4.2 and 4.3, the 
scores for two questions (1 for correct and 0 for incorrect) is plotted against the 
composite test score for each student. Figure 4.2 shows a question with poor 
discrimination and Figure 4.3 shows a question with good discrimination. 


Poorly Discriminating Test Question 


Score 



Composite Test Score (%) 


Figure 4.2. A test question with poor discrimination. 
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Discriminating Test Question 


Score 



Composite Test Score (%) 


Figure 4.3. A test question with good discrimination. 

From Figure 4.2 it can be seen that a student’s composite score is not related to his score 
on the question and thus the question cannot discriminate between students of high and 
low ability. In contrast, a student’s composite score is directly related to his question 
score in Figure 4.3. This question discriminates well between two levels of student 
ability. 

Two hypothetical tests further show that it is difficult to determine if a test is made 
of discriminating questions by examing only the composite scores. Figure 4.4 shows 
histograms of the two hypothetical tests, one made entirely of poorly discriminating 
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Test Made of Poorly Discriminating Questions 



Test Made of Discriminating Questions 



Composita Test Scor* (%) 

Figure 4,4. Histograms of two hypothetical tests, one made of all poorly discriminating questions and the 
other of all discriminating questions. The y-axis in both cases is the munber of students in each bin or 
group. 

questions and the other entirely of question that discriminate well. Scores are evenly 
distributed for the first test, which may lead an evaluator to incorrectly believe that the test 
is made of questions that test across the entire ability range of the class. Although the 
second test sharply divides students into two groups, it provides no further information 
about a students’ ability. By mixing these two types of questions and changing the 
difficulty level of each question, it is possible to construct a test whose histogram has 
nearly any shape. 

Item Response Theory (IRT) examines the characteristics of the questions that 
make up the test rather than just the total score. By determining the difficulty and 
discrimination ability of each question, a test can be constructed to evaluate students over 
a desired ability range or to ensure that a certain cutoff ability is attained by including 
questions of a variety of difficulty. By estimating the discrimination factors, questions 
that are poorly constructed can be eliminated from the test to ensure that the student’s 
true ability is measured. 
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B. ITEM RESPONSE THEORY 


1. Item Characteristic Curve 

“Item response theory rests on two basic postulates: (a) The performance of an 
examinee on a test item(question) can be predicted by a set of factors called ... and (b) the 
relationship between examinees’ item (question) performance and the set of traits 
underlying item performance can be described by a monotonically increasing function 
called an item characteristic curve (ICC)” (Hambleton et al,1991). In other words, if the 
trait used to predict the probability of the student getting a correct response is the 
student’s ability to perform a task, then as the ability of the examinee increases, the 
probability of responding correctly to the question increases. The ICC is defined by 
plotting the proportion of correct responses (or probability of answering correct) for each 
ability level and fitting a smooth curve to those points. Figure 4.5 shows two questions 
with different difficulty. 



Figure 4.5. Item Characteristic Cmve for two questions of different difficulty. 

The question’s difficulty is defined as the ability level where the probability of students of 
that ability getting a correct response is 0.5. For question 1 the difficulty is estimated to 
be 1 while for question 2 the difficulty is estimated at -1.0. The discrimination ability of 
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the question is proportional to the slope of the curve at the point where the probability of 
getting the question correct is 0.5. Figure 4.6 shows two ICCs with different 
discrimination abilities. 



Figure 4.6. Item characteristic curves for two questions with different discrimination 
factors. Question 1 has better discrimination than question 2. 

2. Types of Models 

a. One Parameter Logistic Model 

The one parameter logistic model is one of the more widely used IRT 
models. The item characteristic curves for the one-parameter model is given by the 
equation 


where 


m)= 


^ 1, 2,..., n 


(4.1) 


Pf{6) is the probability that a randomly chosen student with ability 
6 answers question / correctly, 
bi is the question / difficulty parameter, 
n is the number of questions in the test. 
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The difficulty parameter bi is the ability 0 where the probability of getting a 
correct response is 0.5. The higher the hi value, the higher the ability required for the 
student to get the question correct. The difficulty parameter can vary fi-om -oo to +oo 
depending on the scale used for ability but is usually between -3.0 to + 3.0. A question 
with a difficulty parameter of 2.0 would be considered very difficult while one with that of 
-2.0 would be considered very easy. (Hambleton et al, 1991). 

The one-parameter model assumes that all questions have the same 
discriminating value. There are no other item characteristics that define the question. In 
addition, there is no consideration that the student might guess at an answer, which would 
be possible on a multiple choice test. 

b. Two-Parameter Logistic Model 

To account for differences in the discriminating ability of questions, the 
two-parameter model was first developed by Lord (1952) and was based on the 
cumulative normal distribution. A similar and more commonly used model introduced by 
Bimbaum is to substitute the two parameter logistic function for the two parameter ogive 
function as the form of the ICC. (Hambleton et al, 1991) The item characteristic curves 
for the two-parameter logistic model are given by the equation 






i=l,2. 


n. 


(4.2) 


In this model the parameters are defined as in the one-parameter model 
with new parameter a, called the item discrimination factor and a scaling factor D. The 
item discrimination factor or parameter a, is proportional to the slope of the ICC at the 
point bi on the ability scale. The higher the value of Oi, the more discriminating the 
question. Questions with higher discrimination values are more useful for separating 
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students into different ability levels while a question with the discrimination factor near 
zero would provide no discrimination. The value of a, is usually between 0 and +2.0. D 
is a scaling factor equal to 1.7 when the logistic model is used. Like the one parameter 
model, the two parameter model does not provide for guessing. (Hambleton et al, 1991) 

c. Three Parameter Model 

The three-parameter model adds a factor that takes into account the 
possibility for the student to guess the correct answer. This model is most useful for tests 
with multiple choice items and is not used in this study since none of the tests are multiple 
choice. For a full discussion of the three-parameter model see Hambleton et al (1991). 

C. FITTING THE MODEL 

1. Model Used 

The model that is used for analyzing the tests is called the two-parameter model. In 
fact, there are more than two parameters to be estimated in this model since an ability for 
each student must be estimated along with the two parameters for each question. The 
probability of getting a correct response on question i for the two-parameter model given 
in Equation 4.2 is equivalent to the logistic regression model 


where 


r p. ^ 
V,i--^y 


= DaXe-b,). 


(4.3) 
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When student ability 0 is also unknown and considered a parameter to be estimated, the 
logit of the probability of a correct response (Equation 4.3) is not linear in the parameters. 
This model does not fall into the generalized linear model framework (eg McCullough and 
Nelder (1989)) and thus the usual packages for fitting logistic regression models are not 
directly applicable for estimating the question parameters a, and b, and ability 0. 

Techniques to estimate both the students’ ability parameters and the question’s 
parameters simultaneously are iterative. They require that an estimate be made of either 
the students’ ability parameters or the question’s parameters to start with. Both 
Hambleton( 1983) and Baker (1992) suggest starting with an estimate of the student’s 
ability parameters such as the normalized test scores. The normalized scores for all the 
students are used to estimate the question parameters and then these item parameters are 
used in an iterative process to estimate the student’s ability. Methods for estimating 
ability and question parameters are discussed in the following sections. These steps are 
repeated until the change of parameters between each estimation of student’s abilities and 
question parameters is within acceptable limits. 

2. Item Parameter Estimation 

The parameter estimation for each question is made using the data, the estimate of 
each students ability, and the students score on each question (0 or 1). When 0 is known 
the logit in Equation 4.3 is linear in the remaining parameters making them easily 
estimated using a logistic regression. This generalized linear model is fit separately for 
each question giving a Maximum Likelihood Estimator (MLE) for a, and bi. Students’ 
scores on a question are the response variables and their estimated ability is the prediction 
variable. Once obtained, these estimates are used in place of the question parameters and 
the student’s abilities are estimated as described in the next section. This is repeated until 
convergence of all parameter estimates is obtained. 
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3. 


Student Ability Parameter Estimation 


Although it is tempting to stop once the question parameters have been estimated, 
it is necessary to continue and estimate the ability parameters until the difference between 
iterations is below a satisfactory level. Baker (1992) derives the following formula using 
the first and second derivatives of the likelihood function to iteratively solve for the 
MLE’s of the ability parameters. To distinguish between students a subscript j is added to 
ability, probability of correct response, and the response variable. Let J be the total 
number of students and. 


Bj, J, be student fs ability, 

Pij=Pi(Bj) be the probability of student j getting question i correct 
and Qirl-Pij- 


If [6J ], is the estimate of Bj at the f iteration then 




i=\ _ 




(4.4) 


where 


Ujj - 0 for an incorrect response from student j on item i 
1 for a correct response from student j on item i. 


Using the estimated question parameters and the previous estimate of ability {Bj ];, Uu Py, 
and Qy are replaced on the right hand side of Equation 4.4 to give the new abilities 

. These new abilities are standardized and then used to give new estimates of the 
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question parameters. This procedure is repeated until the difference between the previous 
estimation of ability and question parameters are small enough. 

4. Preparing the Data 

When estimating parameters for a test it is necessary to remove questions and 
students that will cause problems for the model. Questions that are either missed by all 
students or correctly answered by all students lead to difficulty parameters that are infinite 
and cannot be estimated. In addition, the ability of students who answer all questions 
correctly or miss all questions will approach infinity or negative infinity. (Baker, 1992), 
Similar problems arise for questions that only a few students answer correctly or 
incorrectly and with students who either answer a few questions correctly or incorrectly. 
The data must be screened to remove these types of questions and students prior to 
parameter estimation. Additionally, questions with low discrimination or no 
discrimination will cause the difficulty of the question to approach infinity and must also 
be removed. Once the data is properly screened, the parameter estimation process can 
begin. 


D. MODEL FIT 

1. Screening Results 

The results of screening the Spanish Subskill tests are shown in Table 4.1. Spanish 
FLO’s 30 and 40 require little screening of the data. Spanish FLO’s 60, 70, and 90 
require extensive screening which lead to poor results in the analysis. As the number of 
usable questions decreases it becomes more difficult to estimate student abilities. This in 
turn results in unstable estimates of question parameters. The end result is that the 
algorithm fails to converge and parameters caimot be estimated. Finally, after screening 
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Test 

Total 

Number 

of 

Questions 

Number of 
Questions 
all Correct 

Number 
Less Than 

4 Wrong 

Number 
Less Than 

4 Right 

Number of 
Questions 
With No 
Discrimination 

Questions 

Remaining 

FLO30 

30 

0 

0 

2 

0 

28 

FLO40 

32 

0 

4 


1 

27 

FLO60 

40 

2 

22 

0 

10 

6 

FLO70 

30 

3 

9 

0 

9 

9 

FLO80 

30 

9 

12 

0 

7 

2 

FLO90 

32 

0 

10 

1 

11 

10 


Table 4.1. Results of data screening for the Spanish Subskill test data. 


Spanish FLO 80 there was not enough data remaining to conduct an analysis. 

Since there were only 29 students in the Russian data set, the Subskill test data set 
was not large enough to conduct IRT analysis. 


2. Results of IRT Parameter Estimation 


As can be expected from the results of the screening, Spanish FLO’s 30 and 40 
have the best results, while FLO’s 60 and 70 have the worst. Table 4.2 shows how many 


Test 

Questions Remaining 
after screening 

Questions Successfully 
Modeled 

FLO 30 

28 

25 

FLO 40 

27 

27 

FLO 60 

6 

2 

FLO 70 

9 

5 

FLO 80 

2 

0 

FLO 90 

10 

6 


Table 4.2. Number of questions successfully modeled for each Spanish FLO. 


questions are successfully modeled using the IRT parameter estimation technique 
described. The estimated values of the parameters for each test can be found in Appendix 
C. 
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3. 


Goodness of Fit 


The item response model given in Equation 4.3 is a first approximation to the 
relationship between a student’s ability and the probability of answering the question 
correctly. Veiy little work has been done on methods for checking the adequacy of an item 
response model. A recent paper suggests fitting a nonparametric item response model 
such as a generalized additive model (Douglas, 1995) and then checking to see how close 
the parametric ICC is to the nonparametric ICC. For Spanish FLO 30 and 40, a 
generalized additive model is fit to each question separately treating the estimated ability 
from the item response model as the explanatory variable, i.e. 


log 




- a. +B^s{0) 


(4.5) 


where s(6) is a “smooth” fixnction of 6Xo be estimated in the fit of the generalized additive 
model. Rather than follow Douglas’ suggestion to plot the parametric ICC versus the 
nonparametric ICC, it is often more revealing to look at some form of residuals, when 
examining the fit of a model. Plots of partial residuals versus ability (Hastie and 
Tibshirani, 1991) are used to asses fit. The partial residuals are given by 


4o-4) 


+ s,{d.) 


(4.6) 


where P^J is the estimate of Py computed from the generalized additive model. A linear 
relationship in the plot of the partial residuals versus ability indicates that s(0) is indeed 
linear and the IRT model is adequate. A nonlinear trend indicates that the ERT model 
probably does not adequately describe the relationship between students ability and the 
probability of answering correctly. 
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From the plots of both the Spanish FLO’s 30 and 40 question it appears 
that about 80% of the questions behave in a linear fashion. However, the remainder of the 
questions show that the relationship between ability and logit of the probability of 
answering the question correctly is not linear. Figure 4.7 shows a plot of the partial 
residuals of a 


Additive Fit with Smoothing 


0 

Partial Residual 

•5 

-10 

-10 12 
Ability 

Additive Fit with Smoothing and Confidence Interval 

10 

0 

Partial Residual 

-20 

-40 

Figure 4.7. Plot of a the partial residuals from a general additive model of a question 
from Spanish FLO 30 showing the lack of a linear relationship. 

smoothed general additive model. From this figure it can be seen that the relationship is 
not linear. After examining the width of the confidence interval band in the lower graph of 
Figure 4.7 it is apparent that care must be taken in evaluating the residual plots. For many 
questions in FLO’s 30 and 40 the confidence band is wide enough that it is not possible to 
tell whether the trend in the partial residual plot is an artifact of variation in the data or an 
indication that the IRT model does not fit. One reason for the large confidence bands are 
the small sample sizes. With a larger sample size, the confidence band will tighten and the 
relationship between 0 and Pi(9) or the logit will be more clear. “The linear model is a 
convenient but crude first-order approximation to the prediction surface, and in many 
cases it is adequate” (Hastie and Tibshirani, 1991). Thus, with the sample size available 



-10 12 
Ability 
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and for the purposes of this look at the discrimination ability of the FLO’s, the IRT model 
is adequate. 

E. INTERPRETING THE MODEL 

1. Information Functions 

One of the more useful aspects of IRT is the information function. When the 
question parameters are known, the amount of information provided by a question at a 
specific ability level can be determined using a question’s information function. The 
information functions fi'om all the questions in a test can be combined to determine the 
amount of information provided by the test at each ability level. This can be used to 
construct or modify a test to ensure that a certain cutoff ability level is tested for or to that 
a wide range of ability is covered by the test. The amount of information provided by a 
question is directly related to the discrimination power of that question. Figure 4.8 is a 



Figure 4.8. Information fimctions of two questions of dififerent discrimination 
parameter but the same difficulty parameter. The information values are inversely 
related to the SE(e). 

graph of two questions’ information functions, each question with the same difficulty but 
different discrimination parameters. The peak of each function occurs at the same point 
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on the x-axis (ability), but the question with poorer discrimination does not provide as 
much information. The item information function is given by the equation 




P<(e)QAe) 


/= 1, 2 ,.... n. 


(4.7) 


By summing the information functions over questions in a test, the test information 
function is given by 


( 4 . 8 ) 

1=1 

The amount of information provided by a test at ^is inversely related to the precision with 
which the ability is estimated at that point (Hambleton et all, 1991). Using the relationship 

(4.9) 

where SE0) is the standard error of it is possible to determine the information 
required for a specific precision level. By comparing plots of the desired information level 
to the actual information provided by a test it is possible to determine which types of 
questions need to be added to the test to achieve that level. Figure 4.9 shows the graph of 
a sample test’s information function and the information function with SE equal to 0.5. 
From this figure it can be determined that both easier and more diflScult questions need to 
be added to achieve the precision level of 0.5 over an ability range of -2.0 to +2.0. 
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Sample Test Information Function and Desired 
Information Function 



Figure 4.9. Graph of desired information function and sample test information 
function. The information values are inversely related to the SE(0). 


2. Actual Information Functions 


Figure 4.10 shows the information functions for both Spanish FLO’s 30 and 40 
with an additional curve representing desired precision level of SE=0.5 for comparison. 
Figure 4.11 shows the information functions for Spanish FLO’s 60, 70, and 90 also with 
desired curve representing precision levels of SE=0.5. 


Information Functions for Spanish FLO'sSO and 40 



Figure 4.10. Information fimctions for Spanish FLO’s 30 and 40 with desired precision 
curves of SE=0.5. The information values are inversely related to the SE(0). 







desired curve representing SE=0.5. The information values are inversely related to the SE(0). 

F. DISCUSSION OF RESULTS 

The results of the data screening reveal that for Spanish FLO’s 60, 70, 80 and 90, 
many of the questions are not providing information to differentiate between students. As 
seen in Table 4.1, the majority of the questions are either answered correctly by all the 
students or provide no discrimation between students of different abiltiy. The few 
questions that do remain provided limited information about the student’s ability as can be 
seen in Figure 4.11. From these results, many of the questions on these tests can be 
replaced with more difficult and better discriminating questions to provide precise 
information about the student’s ability. One exception to this may be with Spanish FLO 
60 which tests number transcription ability at a specific rate. For this type of test it may be 
difficult to vary the difficulty of the questions. Since the students at DLI are heavily 
exposed to numbers throughout their training, they may all be able to transcribe at the 
desired rate. The questions that show poor discrimination should still be examined to 
determine if they should be removed. 

The two tests that modeled well are Spanish FLO 30 and 40. The questions on 
these tests covered a broad range of ability and most have sufficient discrimination ability. 
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of a precision level equivalent to a SE of 0.5 is desired it would be necessary to add 
questions in the -2.5 to the -1.0 difficulty level for FLO 30. 

When discussing the results of IRT analysis, it must be remembered that it is 
assumed that the data used is representative of the population that takes the test. If DLI 
plans on using the Subskill tests to evaluate field personnel as it does with the DLPT, it 
will be necessary to have field personnel take the tests and evaluate their individual 
question responses. The inclusion of field personnel should decrease the difficulty level of 
the questions since adding the field personnel will change the population and should result 
in a higher test average for the tests. For questions that the current population of test 
takers found difficult the difficulty will be lowered on a standardized scale. 

In general, the ERT analysis shows that many of the questions on the Subskill tests 
can be eliminated without affecting the determination of the student’s ability. Using IRT, 
it is possible to determine how many of these question need to be replaced with more 
difficult and better discriminating questions to achieve a desired precision level. 
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V. TEST REDUCTION 


A. MOTIVATION 

1. Background 

Just prior to graduation students at DLI take the 10 Subskill tests and 3 Defense 
Language Proficiency Tests (DLPTs). As mentioned previously, 6 of the 10 Subskill tests 
are graded at the Test Management Center by any one of the three GS-5 employees whose 
job is to grade these tests. The remaining 4 Subskill tests are graded at the respective 
language school. Each rater can grade an average of 7 tests an hour, requiring about 26 
hours to grade all six of the Subskill tests for a class of size 30. Since both the DLPTs and 
the Subskill tests determines the students’ ability to read, speak and listen to the target 
language, it is possible that the tests will measure overlapping abilities and that at least one 
of the tests may be eliminated. A reduction in the number of Subskill tests would result in 
a monetary savings by reducing the workload for the raters, test proctors, and students. 
The sponsors of the Subskill tests are not willing to reduce the number of Subskill tests 
unless it can be shown that no significant loss of information about the students’ ability to 
perform the Final Learning Objectives will occur. The DLPTs are not being considered 
for elimination since they are used to evaluate both students and field personnel 
proficiency levels. 

2. Methods to be Employed 

Principle components and a second method based on multiple correlation both 
presented by Jolliffe (1986) are used to determine which Subskill tests can be removed. 
The goal of both methods is to eliminate only those tests that reduce the amount of 
information in the tests by a small amount. In other words, remove those tests which are 
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highly collinear with a linear combination of the remaining tests. Because the two 
methods use different criteria to select tests to remove, they may select tests in a different 
order and sometimes different tests altogether. 

3. Data Examined 

To continue with the analysis of the previous two chapters, these methods are 
applied to Spanish and Russians students graduating between October 1994 and March 
1996. Variability in test scores can be caused by differences in ability and by tests with 
poorly discriminating questions. From Chapter 4 it was shown that Spanish FLO 60 had 
10 poorly discriminating questions. To determine if removing the 10 poorly discriminating 
questions will affect the order or selection of tests to be removed and the amount of 
variance still explained by the remaining tests, two additional data sets are studied. The 
two data sets are constructed using the scores of the Spanish students studied in the 
previous two chapters. The first data set is constructed from the data as retrieved from 
DLLs data base. The second data set is the same as the previous with the exception of the 
FLO 60 data. The 10 poorly discriminating questions are removed from the test and the 
students’ grades are then recomputed. 

B. METHODOLOGY 

1. The Principal Component Method 

Principle component analysis is a common method for reducing the dimensions of 
a data set while retaining as much as possible of the variation in the original data set. “The 
reduction is achieved by transforming to a new set of variables, the principle components, 
which are uncorrelated, and which are ordered so that the first few retain most of the 
variation present in all of the original variables” (Jolliffe, 1986). One problem with using 
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principle components to reduce the dimensions of a data set is that they are often linear 
combinations of all of the variables (Dunteman, 1983). This would defeat the purpose of 
reducing the dimensions of the data set. However, principle components can still be used 
to determine which variables to remove fi'om the data set. In a method referred to as the 
B2 model a principle component analysis is performed on all the original K variables and 
the eigenvalues are inspected. For all principle components whose eigenvalues are less 
than some hi, the corresponding eigenvectors are inspected starting with the vector with 
the smallest eigenvalue and continuing with the vector having the next smallest until is 
reached. For each vector, the variable associated with the largest component of the vector 
is removed from the data set. If the variable has already been removed, the variable 
associated with the next largest component is removed. Jolliffe (1972) recommends a 
value for hi of 0.7. This value leads to a data set that only retains 60-70 % of the original 
data set’s variance, which is not an acceptable amount. Therefore, in this analysis 
variables are removed until less than 90% of the original variance is still explained by the 
remaining variables. (Jolliffe, 1986) 

2. The Multiple Correlation Method 

A closely related method to remove variables from a data set uses multiple 
correlation. This method explained by Jolliffe (1972) uses multiple correlation to pick p 
variables to describe the variance of the original K variables. “Method A2 is a step-wise 
method which first rejects that variable which has a maximum multiple correlation with the 
remaining .^-1 variables. Then at each stage, when q variables remain, the variable having 
the largest multiple correlation with the other q-\ variables is rejected.” The process 
continues until p variables remain. For this analysis, the process continues until the less 
than 90% of the original variance is explained by the remaining p variables. 
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3. 


Comparing Methods 


To determine the effectiveness of each method it is necessary to measure the 
amount of variance remaining after the removal of a test from the data set. Jolliffe (1973) 
uses the following formula to compute the remaining variance in a test; 

P^tr. 

= _(5.1) 

K 

where p = the number of remaining tests or variables 

r, = the multiple correlation factor of removed test / and the 
remaining tests 

K = the original number of tests. 


C. RESULTS 

The A2 and B2 methods were first applied to all Spanish and Russian students 
graduating between October 1994 and March 1996. The results for the Spanish students 
are shown in Table 5.1 and for the Russian students in Table 5.2. There were originally 
529 Spanish students and 349 Russian students in the data base. Due to missing test 
scores in the data base, the data set was reduced to 426 Spanish and 262 Russian students. 
The results for the Spanish data set show that both methods remove FLO 40 first followed 
by FLO 30. Removing any more of the Subskill tests reduces the explained variance to 
below 90 %, which was the desired cutoff level. The results for the Russian data set are 
different than that of the Spanish but also indicate that up to two Subskill tests can be 
removed without going below the desired 90 % explained variance. In addition the 
methods suggest a different order to remove the tests but pick the same two to be 
removed when removing two tests. 
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Number of Tests 
Removed 

Method 

Tests Removed 

Variance Explained by the 
Remaining Tests (in percent) 

1 

Multiple 

Correlation 

FLO 40 

96.45 


Principle 

Components 

FLO 40 

96.45 

2 

Multiple 

Correlation 

FLO 40, FLO 30 

91.57 


Principle 

Components 

FLO 40, FLO 30 

91.57 

3 

Multiple 

Correlation 

FLO 40, FLO 30, FLO 50 

86.40 


Principle 

Components 

FLO 40, FLO 30, FLO 50 

86.40 


Table 5.1. Results of variable reduction using data on Spanish students graduating between October 1994 
and March 1996. 


Number of Tests 
Removed 

Method 

Tests Removed 

Variance Explained by the 
Remaining Tests (in percent) 

1 

Multiple 

Correlation 

FLO 30 

95.54 


Principle 

Components 

FLO 10 

95.38 

2 

Multiple 

Correlation 

FLO 30, FLO 10 

90.86 


Principle 

Components 

FLO 10, FLO 30 

90.86 

3 

Multiple 

Correlation 

FLO 30, FLO 10, FLO 40 

85.80 


Principle 

Components 

FLO 10, FLO 30, FLO 50 

85.27 


Table 5.2. Results of variable reduction using data on Russian students graduating between October 1994 
and March 1996. 


Next, the methods were applied to a smaller subset of the Spanish students to 
determine if removing the 10 poorly discriminating questions from the Spanish FLO 60 
would affect the results. Table 5.3 shows the results of applying both methods to the 
subset without changing the FLO 60. Table 5.4 shows the results of both methods with 
the 10 questions removed. For both sets of data it is possible to remove three Subskill 
tests without losing more than 10 % of the variance. For the unaltered FLO 60 data set, 
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Number of Tests 
Removed 

Method 

Tests Removed 

Variance Explained by the 
Remaining Tests (in percent) 

1 

Multiple 

Correlation 

FLO 40 

98.27 


Principle 

Components 

FLO 30 

97.97 

2 

Multiple 

Correlation 

FLO 40, FLO 70 

95.09 


Principle 

Components 

FLO 30, FLO 40 

93.89 

3 

Multiple 

Correlation 

FLO 40, FLO 70, FLO 30 

90.52 


Principle 

Components 

FLO 30, FLO 40, FLO 70 

90.52 

4 

Multiple 

Correlation 

FLO 40, FLO 70, FLO 30, FLO 90 

86.97 


Principle 

Components 

FLO 30, FLO 40, FLO 70, FLO 90 

86.97 


Table 5.3. Results of variable reduction using data from DLI database on Spanish students studied in the 
previous two chapters. 


Number of 

Tests Removed 

Method 

Tests Removed 

i 

Variance Explained by the 
Remaining Tests (in percent) 

1 

Multiple 

Correlation 

FLO 40 

97.95 


Principle 

Components 

FLO 30 

97.92 

2 

Multiple 

Correlation 

FLO 40, FLO 30 

95.47 


Principle 

Components 

FLO 30, FLO 40 

95.47 

3 

Multiple 

Correlation 

FLO 40, FLO 30, FLO 90 

92.37 


Principle 

Components 

FLO 30, FLO 40, FLO 70 

91.76 

4 

Multiple 

Correlation 

FLO 40, FLO 30, FLO 90, FLO 70 

88.30 


Principle 

Components 

FLO 30, FLO 40, FLO 70, FLO 90 

88.30 


Table 5.4. Results of variable reduction using data from DLI database on Spanish students studied in the 
previous two chapters with the FLO 60 modified by removing the 10 poorly discriminating questions. 


the methods select the same three variables but in a different order as did the Russian. 
Finally, using the modified data set, both methods select FLO’s 30 and 40 as the first two 
tests to be removed but then select different tests to be removed after that. 
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D. DISCUSSION OF RESULTS 


The three features to examine in the results of the variable reduction methods are 
the difference between the methods, the effects of removing the 10 questions from Spanish 
FLO 60, and the number of tests that can be removed. 

The results from Table 5.1 show that there is no difference in methods when 
examining the larger data base of Spanish students. When applied to the Russian data set, 
as well as to the smaller Spanish data set, the methods select the same two Subskill tests, 
but in a different order. In addition, the smaller Spanish data sets allow for the removal of 
a third test. This third test is the same for the unmodified data set but different for the 
modified data set. The reason that the methods select tests in a different order and 
sometimes different tests altogether is that the A2 method (multiple correlation method) 
considers the impact of removing a test on the remaining data set. The B2 method 
(principle components method) focuses on removing the test that explains the most 
variance in that principle component without considering how much that test contributes 
to the variance in the remaining principle components. Because of this, the B2 method, 
when it differs from the A2 method, will not always select the best test to remove. Jolliffe 
(1972) also found that the A2 method was the better of the two methods but felt that 
neither method was notably better or worse than the other for artificial data. 

The next item to examine is the effect of removing the 10 questions from the 
Spanish FLO 60 in the small data set. Recall that the purpose of removing the 10 poorly 
discriminating questions was to see if, by removing some of the variability in that test if the 
order or selection of tests to be removed and the amount of variance still explained by the 
remaining tests would be affected. As seen in Tables 5.3 and 5.4 there was little effect on 
the amount of variance explained after the same number of tests had been removed. There 
was a difference in selection of tests to be removed using the A2 method. FLO 70 is 
selected to removed second followed by FLO 30 third in the unmodified data set whereas 
FLO 30 is selected second followed by FLO 90 third when examining the modified data 
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set. The end result is that there is less than a 1 percent difference in explained variablity 
caused by removing the 10 questions. This indicates that the 10 questions do not account 
for a large amount of the variance present in the data set. 

Finally, by examining the results of all four data set it can be seen that at least one 
of the Subskill tests can be removed from both languages. FLO 40 could be removed 
from the Spanish and FLO 30 could be removed from the Russian test batteries with no 
significant loss of information about the students. 
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VI. SUMMARY/RECOMMENDATIONS 


A. SUMMARY 

The Defense Language Institute developed the Subskill tests to determine if the 
graduating students have met the specific training objectives as outlined by the National 
Security Agency and the Defense Intelligence Agency. These Subskill tests have been in 
place for over two years and it is now possible to evaluate them and determine if any 
changes should be made to improve their consistency and efficiency. 

For the grades to be consistent, a student should receive the same score 
independent of who grades their test. The analysis of variance shows that assignment of 
scores is not consistent between raters for the majority of Subskill tests in Russian and 
Spanish. In particular the Spanish FLO’s 30 and 90 have the largest rater magnitudes 
allowing a difference in grades as large as 16 points. 

An efficient test is made of questions that discriminate between students of 
differing abilities and are of different difficulty levels. The Spanish FLO’s 30 and 40 tests 
are made of questions that do this. In contrast are the Spanish FLO’s 60, 70, 80 and 90 
which consist of many questions that are either too easy for all the students or do not 
discriminate among students of different ability. The questions that do not discriminate 
create unnecessary variance in the data set and provide no information about the student’s 
ability. 

Finally, although there was surprisingly little redundancy between the Subskill tests 
and the DLPTs, the Subskill tests and the DLPTs provide some redundant information 
about the student’s abilities. The Spanish FLO 40 and Russian FLO 30 can be removed 
with less than a 5% loss of variability in the data set. 
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B. RECOMMENDATIONS 


Each of the above findings should be addressed to ensure that the Subskill tests 
provide consistent, efficient data to be used to compare students and determine if the 
students have met the training objectives. 

The rater effect can be eliminated by reducing the amount of subjective evaluation 
of the student’s responses and allowing the rater to give partial credit. By providing the 
rater with an English translation of the material and questions given to the student the 
rater will more accurately evaluate if the student has understood the question and 
responded correctly. In addition by allowing partial credit, the differences between rater 
evaluations will be smoothed. 

More data on the Spanish FLO’s 60, 70, 80 and 90 should be gathered to 
determine which test questions provide the same information about the student’s ability. 
To do this it will be necessary to record the students’ scores on individual questions. The 
raters currently fill out a computer scan sheet that reads the score for individual questions 
but this information is not recorded or maintained. The Test Management Center should 
maintain this information for future use. Many of the questions on the Spanish FLO’s 60, 
70, 80, and 90 were answered correctly by all students taking the tests. These questions 
should be replaced by questions of greater difficulty so as to provide more information 
about the student’s ability. The questions that do not discriminate should be fixed or 
eliminated to reduce the amount of variance in the tests. 

The number of Subskill tests can be reduced in both Russian and Spanish. This 
will reduce the amount of time spent administering and grading the tests as well as reduce 
the amount of information that needs to be maintained in the database. Consideration 
must be given to the effects of test removal on teaching methods and students’ motivation 
to ensure that students continue obtain the desired language abilities. 

The rater effect issue should be addressed first. By reducing the subjective 
evaluation of the student’s responses, the relationship between the student’s ability and 
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response to a question may change. This could effect the discrimination power of a test 
question which would in turn result in the question providing useful information about the 
student’s ability. By removing the rater effect the variance in the data base may also be 
reduced which could change both the order in which tests are selected to be removed and 
the amount of variance explained by the remaining tests. 

These changes will ensure that the Subskill tests developed by the Defense 
Language Institute will continue to efficiently provide useful consistent information about 
the students’ language abilities. 
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APPENDIX A. RATER EFFECTS SUMMARY TABLE 


Test 

Spanish 

Russian 

Grader 1 

Grader 2 

Grader 3 

Grader 1 

Grader 2 

Grader 3 



10.4365 



2.069 

6.092 






1.686 

1.111 

FLO 60 

0.0298 

-0.1488 

0.1190 

-0.02874 

0.4023 

-0.3736 

FLO 70 

-6.071 

2.143 

3.929 

-1.341 

-0.3065 

1.6475 

FLO 80 

0.3273 

0.6182 

-0.9455 

1.1111 

-4.2912 

3.1801 

FLO 90 

7.9382 

0.4490 

-8.3872 

3.125 

-0.1078 

-3.0172 


Table A-1. Summaiy Table of the rater effects (points on a 100 point test) for both Spanish and Russian 
Subskill tests. 


Test 

Spanish 

Russian 

Grader 1 

Grader 2 

Grader 3 

Grader 1 

Grader 2 

Grader 3 

FLO 30 

4.072407 

4.257052 

4.855999 

5.205443 

4.481238 

6.039411 

FLO 40 

3.461636 

3.378495 

3.716225 

3.798504 

4.558271 

3.72086 

FLO 60 

0.9529895 

0.8333706 

0.9417565 

1.229651 

1.366055 

1.775972 

FLO 70 

4.310946 

3.730607 

3.858146 

4.04319 

3.169194 

3.137371 

FLO 80 

0.7869588 

0.8573562 

0.963506 

2.672612 

2.072839 

3.262801 

FLO 90 

4.405731 

3.409063 

4.266555 

4.148024 

4.132532 

5.953696 


Table A-2. Summary Table of the standard deviation of the rater effects for both Spanish and Russian 
Subskill tests. 
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APPENDIX B. RESULTS OF IRT ANALYSIS 


Test 

FLO 30 

FLO 40 

Question 

Discrimination 

Difficulty 

Discrimination 

Difficulty 

1 



0.34209 

0.60711 

2 

0.65204 

0.81167 

0.76504 

-1.2717 

3 

0.76636 

0.33544 



4 

1.82756 

1.46266 

0.6502 

0.61854 

5 

0.49392 

5.13756 

1.47022 

0.03907 

6 

1.35672 

0.91748 

0.14762 

-2.9153 

7 

0.70393 

2.59003 



8 



0.79932 

0.95586 

9 

0.30809 

1.29823 

1.08681 

-0.4285 

10 

1.42263 

1.33465 

0.96512 

0.70131 

11 

0.25755 

1.72006 

1.00159 

1.40624 

12 

0.23114 

-1.8944 

1.00336 

-0.185 

13 

0.7918 

0.03094 

0.47112 

0.90085 

14 

0.65978 

0.53889 



15 

0.19879 

3.55872 

0.31207 

0.36201 

16 

0.76315 

0.4139 

0.42472 

-0.1745 

17 

0.52602 

-0.5328 

0.22919 

0.09372 

18 





19 

1.52085 

-0.5827 

0.80814 

1.57112 

20 

1.44141 

-0.1557 



21 



0.71779 

-1.8062 

22 

0.87481 

-1.6094 

1.21803 

1.64727 

23 

0.58767 

-0.1353 

0.39674 

0.78116 

24 

0.3892 

2.8992 

1.10136 

-1.6458 

25 

1 


0.53417 

-0.2441 

26 

0.25716 

-1.1477 

0.36638 

0.31357 

27 

0.7535 

0.91575 

1.08834 

-1.0517 

28 

1.33622 

0.38368 

0.89291 

-0.9072 

^■1 

1.0466 

1.89342 

0.80613 

-0.1257 


1.33622 

0.38368 

0.98729 

1.11133 

31 



0.52382 

0.9355 

32 



0.0678 

3.14E-01 


Table B-1. Question difficulty and discrimination parameters for Spanish FLO’s 30 and 40 found using 
IRT analysis. 
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Test FLO 60_FLO 70_ FLO 90 



1^_ 0,724721 -1.16264 1.927979 -1.10155 

14 0.387 -0.47342 _ 



Table B-2. Question difficulty and discrimination parameters for Spanish FLO’s 60, 70 and 90 found 
using IRT analysis. 
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